Loading the dataset and looking at its structure and variables
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Quality of wines
The qualities of wine seem to be distributed around the median of 6. The tail tail is slightly higher on the lower-quality side, with 5-quality wines being by far the 2nd most numerous quality after 6. It also seems that no wines were given either a 10, or 0-2. Additionally, only 5 wines were of quality 9. As vast majority of wines seem to have a quality of either 5 or 6.
After running ggpairs to view the relationships of features, the plots and the correlation values seem to indicate a poor correlation between the independant variables and the wine quality. This seems quite understandable as it is difficult to imagine there being linear relationships between wine quality and for example salt-, sugar-, and alcohol-content of the wine or acidity.
GGpairs output is not shown here as it looks poor on knit html. To get a clearer view of the variables relationships with wine quality, I will plot them as scatterplots:
Run all the possible independent variables vs wine quality. Also plot lines for mean and median. Omit outliers from the independent variables. Use alpha to gain a clearer view of the independent variable variance. Jitter added to make the plots more effective.
## Warning: Removed 75 rows containing missing values (stat_summary).
## Warning: Removed 75 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (geom_point).
## Warning: Removed 82 rows containing missing values (stat_summary).
## Warning: Removed 82 rows containing missing values (stat_summary).
## Warning: Removed 114 rows containing missing values (geom_point).
## Warning: Removed 68 rows containing missing values (stat_summary).
## Warning: Removed 68 rows containing missing values (stat_summary).
## Warning: Removed 94 rows containing missing values (geom_point).
## Warning: Removed 81 rows containing missing values (stat_summary).
## Warning: Removed 81 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (geom_point).
## Warning: Removed 87 rows containing missing values (stat_summary).
## Warning: Removed 87 rows containing missing values (stat_summary).
## Warning: Removed 405 rows containing missing values (geom_point).
## Warning: Removed 90 rows containing missing values (stat_summary).
## Warning: Removed 90 rows containing missing values (stat_summary).
## Warning: Removed 106 rows containing missing values (geom_point).
## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (geom_point).
## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 98 rows containing missing values (stat_summary).
## Warning: Removed 3538 rows containing missing values (geom_point).
## Warning: Removed 85 rows containing missing values (stat_summary).
## Warning: Removed 85 rows containing missing values (stat_summary).
## Warning: Removed 95 rows containing missing values (geom_point).
## Warning: Removed 84 rows containing missing values (stat_summary).
## Warning: Removed 84 rows containing missing values (stat_summary).
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 78 rows containing missing values (stat_summary).
## Warning: Removed 78 rows containing missing values (stat_summary).
## Warning: Removed 136 rows containing missing values (geom_point).
Looking at the plots it is evident that the data is very dense at qualities of 5 and 6, and significantly less dense at other qualities.
The mean and median of features seems quite similar (largest outliers have been omitted from the plots, which will exaggarate this). Additionally, the correlation of all features with wine quality seem very small or nonexistant. The mean and median values of the features change very little between each wine quality, with the exception of alcohol, which seems to decrease at first for low quality wines, and then linearly increase as quality increases.
Overall, the variance in the variables seems very high in most cases. Chlorides seems to have the lowest variance for good quality wines, but funnily enough the chloride level seems very similar between good and poor wines!
In order to get a better estimate on the variance in the independent variable values, lets create boxplots of the relationships between the independent variables and wine quality.
Sulphates, citric.acid and pH seem to not affect the wine quality at all, since their mean/median seems very static and variance very high in all wine qualities. Also the chlorides, free.sulphur.dioxides seem to differ very little between wine qualities. Therefore these variables will not be studied further in the boxplots.
Box plots of wine quality and alcohol, density, chlorides, fixed.acidity, residual.sugar, and volatile.acidity
The boxplots further identify the problem: The features simply do not seem to be enough in explaining the quality of good wines. The most of the features have high variance in wines of good quality. And the features show very low, nonlinear correlation.
The features that seem to be correlating the most with wine quality seem to be: - Alcohol, where alcohol start a bit higher at poor wines, decreasing until quality of 5 and afterwards increasing somewhat linearly. - density, where better quality wines seem less dense. - volatile.acidity, where better quality wines seem to be slightly less acidic. An outlier here are wines of quality 3, which do not follow a similar curve as the rest of the qualities, but this may very well occur because there are only 20 wines of quality 3. - Also residual sugar showed some correlation with wine quality. The correlation, however, looks very non-monotonic, where the average wines seemed sweeter than poor and good wines.
Lets look at these three relationships further.
To get an idea of the features, lets first look at some descriptive statistics of alcohol, density, and volatile.acidity when they are group by wine quality. Quality of 9 and 3 wines are omitted as there are very few wines rated as such.
## Source: local data frame [5 x 4]
##
## quality mean_alcohol median_alcohol variance_alcohol
## 1 4 10.15245 10.1 1.0064446
## 2 5 9.80884 9.5 0.7175196
## 3 6 10.57537 10.5 1.3173902
## 4 7 11.36794 11.4 1.5538515
## 5 8 11.63600 12.0 1.6387540
## Source: local data frame [5 x 4]
##
## quality mean_density median_density variance_density
## 1 4 0.9942767 0.99410 6.063201e-06
## 2 5 0.9952626 0.99530 6.475670e-06
## 3 6 0.9939613 0.99366 9.141611e-06
## 4 7 0.9924524 0.99176 7.659998e-06
## 5 8 0.9922359 0.99164 7.771407e-06
## Source: local data frame [5 x 4]
##
## quality mean_volatile.acidity median_volatile.acidity
## 1 4 0.3812270 0.32
## 2 5 0.3020110 0.28
## 3 6 0.2605641 0.25
## 4 7 0.2627670 0.25
## 5 8 0.2774000 0.26
## Variables not shown: variance_volatile.acidity (dbl)
Looking at the mean and median of alcohol, they seem to follow a clear trend of increasing at a non-linear, slowing rate. The variance of alcohol at all quality levels is high though.
Looking at the mean and median of density, the differences at different quality levels seems very low. The variance is also very low.
Volatile.acidity mean and median seem to decrease at a slowing rate, converging at somewhere around 0.26. Similar to density, the variance seems quite low compared to other variables (although still not quite as low as in density).
Having this information, lets look at the boxplots of these 3 features closer. The boxplots are zoomed in a way that we can focus on the relationships around the mean and middle quantiles.
Alcohol vs quality
Density vs quality
Fixed.acidity vs quality
The boxplots seem to give further information to the correlation values discovered earlier. The relationship between density and alchol with wine quality displayed in the boxplots seemed to indicate slight nonlinear relationships. To relationships such as these, the pearson correlations calculated in the ggpairs-plots earlier may give some misleading correlation values that underestimate their actual correlation.
Lets try Spearman correlation instead
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$volatile.acidity
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
## Warning in cor.test.default(wines$quality, wines$alcohol, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$alcohol
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4403692
## Warning in cor.test.default(wines$quality, wines$density, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$density
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.348351
## Warning in cor.test.default(wines$quality, wines$volatile.acidity, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$volatile.acidity
## S = 2.3434e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1965617
The correlation between density and wine quality is now slightly higher. Alcohol and volatile.acidity changed very little which can be expected by looking at the boxplots.
As seen in the boxplots, the relationships of quality and especially density seem to be slightly non-monotonic, therefore breaking the assumptions made by Spearman correlation on the data, causing us to suspect the Spearman correlations validity as well.
Lets try to build a linear model for wine quality using independent variables with the largest perceived correlation.
wines$quality <- as.numeric(wines$quality)
m1 <- lm(I(quality) ~ I(volatile.acidity), data = wines)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + alcohol)
Since residual.sugar and sulphates showed some correlation, lets try adding it to the model as well:
m4 <- update(m3, ~ . + residual.sugar)
m5 <- update(m4, ~ . + sulphates)
mtable(m1,m2,m3,m4,m5)
##
## Calls:
## m1: lm(formula = I(quality) ~ I(volatile.acidity), data = wines)
## m2: lm(formula = I(quality) ~ I(volatile.acidity) + density, data = wines)
## m3: lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol,
## data = wines)
## m4: lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol +
## residual.sugar, data = wines)
## m5: lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol +
## residual.sugar + sulphates, data = wines)
##
## ===========================================================================
## m1 m2 m3 m4 m5
## ---------------------------------------------------------------------------
## (Intercept) 6.354*** 95.245*** -36.499*** 74.225*** 96.322***
## (0.036) (3.927) (6.001) (11.977) (12.376)
## I(volatile.acidity) -1.711*** -1.639*** -2.072*** -2.059*** -2.022***
## (0.123) (0.117) (0.110) (0.109) (0.109)
## density -89.445*** 38.992*** -71.546*** -93.896***
## (3.951) (5.920) (11.923) (12.335)
## alcohol 0.399*** 0.286*** 0.261***
## (0.014) (0.018) (0.018)
## residual.sugar 0.052*** 0.061***
## (0.005) (0.005)
## sulphates 0.657***
## (0.099)
## ---------------------------------------------------------------------------
## R-squared 0.038 0.129 0.247 0.264 0.271
## adj. R-squared 0.038 0.129 0.246 0.263 0.270
## sigma 0.869 0.827 0.769 0.760 0.757
## F 192.958 362.791 534.843 438.646 362.919
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -6259.952 -6016.114 -5660.164 -5604.126 -5581.981
## Deviance 3695.351 3345.142 2892.625 2827.187 2801.737
## AIC 12525.903 12040.228 11330.329 11220.251 11177.961
## BIC 12545.393 12066.214 11362.812 11259.231 11223.437
## N 4898 4898 4898 4898 4898
## ===========================================================================
The linear model seems to explain the quality of a wine very poorly. It explains only 27.1% of the variance in wine quality.
Lets see what went wrong:
summary(m5)
##
## Call:
## lm(formula = I(quality) ~ I(volatile.acidity) + density + alcohol +
## residual.sugar + sulphates, data = wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3051 -0.4964 -0.0384 0.4616 3.1776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.32177 12.37637 7.783 8.60e-15 ***
## I(volatile.acidity) -2.02151 0.10859 -18.617 < 2e-16 ***
## density -93.89603 12.33453 -7.612 3.21e-14 ***
## alcohol 0.26084 0.01808 14.428 < 2e-16 ***
## residual.sugar 0.06091 0.00506 12.037 < 2e-16 ***
## sulphates 0.65729 0.09860 6.666 2.92e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7568 on 4892 degrees of freedom
## Multiple R-squared: 0.2706, Adjusted R-squared: 0.2698
## F-statistic: 362.9 on 5 and 4892 DF, p-value: < 2.2e-16
anova(m5)
## Analysis of Variance Table
##
## Response: I(quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## I(volatile.acidity) 1 145.64 145.64 254.294 < 2.2e-16 ***
## density 1 350.21 350.21 611.485 < 2.2e-16 ***
## alcohol 1 452.52 452.52 790.122 < 2.2e-16 ***
## residual.sugar 1 65.44 65.44 114.259 < 2.2e-16 ***
## sulphates 1 25.45 25.45 44.436 2.918e-11 ***
## Residuals 4892 2801.74 0.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The standard error seems high for density. It is quite a bit higher than the rest. The Pf(>t)-value seems quite low though, which causes me to believe with high confidence that all of the 4 independent variables do affect wine quality. The F-values and the p-tests of the independent variables all lead me to believe that the variables improve the regression model from the mere intercept-model.
Despite the fact that the independent variables used in the linear model obviously affect the model in a positive manner, the model created still seems ill suited to explain the wine quality, as seen from the low R-squared value.
The features provided seem ill suited to explain wine quality. This is quite understandable as the quality of wine should logically not have a linear positive correlation with features such as acidity, sweetness (sugar) or alcohol. There are numerous good wines that can be either sweet or not-so-sweet. This also explains the high variance of these features in wine qualities and the fact that some features could be quite similar in both low quality and high quality wines. It all comes down to the combination of different flavors and of course, personal preference.
The dataset was also quite poor, as almost all of the wines in the dataset had qualities of 5-7. It would have been interesting if the dataset had contained more data on very high and very low quality wines.